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Abstract 

(N 

' Multi-armed bandit problems are considered as a paradigm of the trade-off between exploring the environ- 

ment to find profitable actions and exploiting what is already known. In the stationary case, the distributions 
' of the rewards do not change in time. Upper- Confidence Bound (UCB) policies, proposed in Agrawal (1995) 

, and later analyzed in Auer et al. (2002), have been shown to be rate optimal. 

(~| . A challenging variant of the MABP is the non-stationary bandit problem where the gambler must de- 

cide which arm to play while facing the possibility of a changing environment. In this paper, we consider 
the situation where the distributions of rewards remain constant over epochs and change at unknown time 
instants. We analyze two algorithms: the discounted UCB and the sliding-window UCB. We establish for 
these two algorithms an upper-bound for the expected regret by upper-bounding the expectation of the num- 
ber of times a suboptimal arm is played. For that purpose, we derive a Hoeffding type inequality for self 
normalized deviations with a random number of summands. We establish a lower-bound for the regret in 
presence of abrupt changes in the arms reward distributions. We show that the discounted UCB and the 
j- , sliding-window UCB both match the lower-bound up to a logarithmic factor. 

Keywords: Multi-armed bandit, reinforcement learning, deviation inequalities, non-stationary environment 



1. Introduction 

Multi-armed bandit (MAB) problems, modelling allocation issues under uncertainty, are fundamental to 
stochastic decision theory. The archetypal MAB problem may be stated as follows: there is a bandit with 
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■ K independent arms. At each time step, the player can play only one arm and receive a reward. In the 

stationary case, the distribution of the rewards are initially unknown, but are assumed to remain constant 
during all games. The player iteratively plays one action (pulls an arm) per round, observes the associated 
reward, and decides on the action for the next iteration. The goal of a MAB algorithm is to minimize 
the expected regret over T rounds, which is defined as the expectation of the difference between the total 
reward obtained by playing the best arm and the total reward obtained by using the algorithm (or policy). 
The minimization of the regret is achieved by balancing exploitation, the use of acquired information, with 
exploration, acquiring new information. If the player always plays the arm which he currently believes to 
be the best, he might miss to identify another arm having an actually higher expected reward. On the other 
hand, if the gambler explores too often the environment to find profitable actions, he will fail to accumulate 
as many rewards as he could. For several algorithms in the literature (e.g. Lai and Robbins (1985); Agrawal 
(1995)), as the number of plays T goes to infinity, the expected total reward asymptotically approaches that 
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of playing a policy with the highest expected reward, and the regret grows as the logarithm of T. More 
recently, finite-time bounds for the regret have been derived (see Auer et al. (2002); Audibert et al. (2007)). 

Though the stationary formulation of the MABP allows to address exploration versus exploitation chal- 
lenges in a intuitive and elegant way, it may fail to be adequate to model an evolving environment where 
the reward distributions undergo changes in time. As an example, in the cognitive medium radio access 
problem Lai et al. (2007), a user wishes to opportunistically exploit the availability of an empty channel in a 
multiple channels system; the reward is the availability of the channel, whose distribution is unknown to the 
user. Another application is real-time optimization of websites by taigetting relevant content at individuals, 
and maximize the general interest by leai^ning and serving the most popular content (such situations have 
been considered in the recent Exploration versus Exploitation (EvE) PASCAL challenge by Hartland et al. 
(2006), see also Koulouriotis and Xanthopoulos (2008) and the references therein). These examples illus- 
trate the limitations of the stationary MAB models. The probability that a given channel is available is likely 
to change in time. The news stories a visitor of a website is most likely to be interested in vary in time. 

To model such situations, we need to consider non-stationary MAB problems, where distributions of 
rewards may change in time. We show in the following that, as expected, policies tailored for the stationary 
case fail to track changes of the best arm. In this paper, we consider a particular non-stationary case where 
the distributions of the rewards undergo abrupt changes. We derive a lower-bound for the regret of any 
policy, and we analyze two algorithms: the Discounted UCB (Upper Confidence Bound) proposed by Koczis 
and Szepesvari and the Sliding Window UCB we introduce. We show that they are almost rate-optimal, as 
their regret almost matches a lower-bound. 

1.1 The stationary MAB problem 

At each time s, the player chooses an arm G {1, . . . , K} to play according to a (deterministic or random) 
policy TT based on the sequence of past plays and rewards, and obtains a reward Xs(/s)'. The rewards 
{Xs{i)}s>i for each arm i G {!,••• ,K} are modeled by a sequence of independent and indentically 
distributed (i.i.d.) random variables from a distribution unknown to the player. We denote by ^{i) the 
expectation of the reward Xi{i). 

The optimal (oracle) policy vr* consists in always playing the arm i* G {!,..., K} with largest expected 
reward 

= max , i* = argmax^(i) . 

^<i<K l<i<K 

The performance of a policy vr is measured in terms of regret in the first T plays, which is defined as 
the expected difference between the total rewards collected by the optimal policy vr* (playing at each time 
instant the arm i* with the highest expected reward) and the total rewards collected by the policy vr. 

Denote by Nt{i) = Yll=i ^{ia=i} number of times arm i has been played in the t first games. The 
expected regret after T plays may be expressed as: 



T 



^ {/.(*)-/.(/*)} 



.t=l 

where the expectation under policy vr. 



1. Note that we use here the convention that the reward after at time s if the i-th arm is played is supposed to be Xs{i) and 
not the A'^s(*)-th reward in the sequence of rewards for arm i, where Ns{i) denotes the number of time the arm i has been 
played up to time s; while this convention makes no difference in the stationary case, because the distribution of the rewards 
are independent, it is meaningful in the non- stationary case, since the distribution of the arm may change even if the arm has 
not been played. These models can be seen as a special instance of the so-called restless bandit, proposed by Whittle (1988). 
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Obviously, bounding the expected regret after T plays essentially amounts to controlling the expected 
number of times a sub-optimal arm is played. In their seminal paper, Lai and Robbins (1985) consider 
stationary MAB problem, in which the distribution of rewards was taken from a one-dimensional parametric 
family (each being associated with a different value of the parameter, unknown to the player). They have 
proposed a policy achieving a logarithmic regret. Furthermore, they have established a lower-bound for 
the regret for policy satisfying an appropriately defined consistency condition, and show that their policy 
was asymptotically efficient. Later, the non-parametric context has been considered; several algorithms 
have been proposed, among which softmax action selection policies and Upper-Confidence Bound (UCB) 
policies. 

Softmax methods are randomized policies where, at time t, the arm It is chosen at random by the player 
according to some probability distribution giving more weight to arms which have so-far performed well. 
The greedy action is given the highest selection probability, but all the others are ranked and weighted 
according to their accumulated rewards. The most common softmax action selection method uses a Gibbs, 
or Boltzman distribution. A prototypal example of softmax action selection is the so-called EXP3 policy 
{for Exponential-weight algorithm for Exploration and Exploitation), which has been introduced by Freund 
and Schapire (1997) for solving a worst-case sequential allocation problem and thouroughly examined as 
an instance of "prediction with limited feedback" problem in Chapter 6 of Cesa-Bianchi and Lugosi (2006) 
(see also Auer et al. (2002/03); Cesa-Bianchi and Lugosi (1999)). 

UCB methods are deterministic policies extending the algorithm proposed by Lai and Robbins (1985) 
to a non-parametric context; they have been introduced and analyzed by Agrawal (1995). They consist in 
playing during the t-th round the arm i that maximizes the upper bound of a confidence interval for expected 
reward which is constructed from the past observed rewards. The most popular, called UCB-1, relies 
on the upper-bound Xt{i) + ct{i), where Xtii) = (-^t(^))^^ Z]s=i -^s(^)l{/s=i} denotes the empirical 
mean, and ct{i) is a padding function. A standard choice is ct{i) = By^^ log{t)/Nt{i), where B is an 
upper-bound on the rewai^ds and ^ > is some appropriate constant. UCB-1 is defined in Algorithm 1. 



Algorithm 1 UCB-1 


for t from 1 to A', play arm It = t; 




for t from + 1 to T, play arm 






It = &rg max Xt{i) + ct{i). 




l<i<K 







UCB-1 belongs to the family of "follow the perturbed leader" algorithms, and has proven to retain 
the optimal logarithmic rate (but with suboptimal constant). A finite-time analysis of this algorithm has 
been given in Auer et al. (2002); Auer (2002); Auer et al. (2002/03). Other types of padding functions are 
considered in Audibert et al. (2007). 



1.2 The non-stationary MAB problem 

In the non-stationary context, the rewards {Xs{i)}s>i for arm i are modeled by a sequence of independent 
random variables from potentially different distributions (unknown to the user) which may vary across time. 
For each s > 0, we denote by fis{i) the expectation of the reward Xs{i) for ai^m i. Likewise, let it be 
the arm with highest expected reward, denoted fJ.t{*), at time t. The regret of a policy tt is now defined as 
the expected difference between the total rewards collected by the optimal policy tt* (playing at each time 
instant the arm ) and the total rewards collected by the pohcy tt. Note that, in this paper, the non-stationary 
regret is not defined with respect to the best arm on average, but with respect to a strategy tracking the best 
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arm at each step (this notion of regret is similar to the "regret against arbitrary strategies" introduced in 
Section 8 of Auer et al. (2002/03) for the non-stochastic bandit problem). 

In this paper, we consider abruptly changing environments: the distributions of rewards remain constant 
during periods and change at unknown time instants called breakpoints. In the following, we denote by 
T^- the number of abrupt changes in the reward distributions that occur before time T. Another type of 
non-stationary MAB, where the distribution of rewards changes continuously, are considered in Slivkins 
and Upfal (2008). 

Standard soft-max and UCB policies are not appropriate for abruptly changing environments: as stressed 
in Hartland et al. (2006), "empirical evidence shows that their Exploration versus Exploitation trade-off is 
not appropriate for abruptly changing environments". To address this problem, several methods have been 
proposed. 

In the family of softmax action selection policies, Auer et al. (2002/03) and Cesa-Bianchi et al. (2006, 
2008) have proposed an adaptation referred to as EXP3.S of the Fixed-Share algorithm, a computationally 
efficient variant of EXP3 called introduced by Herbster and Warmuth (1998) (see also (Cesa-Bianchi and 
Lugosi, 2006) and the references therein). Theorem 8.1 and Corollary 8.3 in Auer et al. (2002/03) state that 
when EXP3.S is tuned properly (which requires in particular that Tt is known in advance), the expected 
regret is upper-bounded as 



Compared to the stationary case, such an upper-bound may seem deceiving: the rate 0{\/T logT) is much 
larger than the 0(log T) achievable in absence of changes. But actually, we prove in Section 4 that no policy 
can achieve an average regret smaller than 0{VT) in the non-stationary case. Hence, EXP3.S matches the 
best achievable rate up to a factor -y/log T. Moreover, by construction this algorithm can as weU be used in 
an adversarial setup. 

On the other hand, in the family of UCB policies, several attempts have been made; see for examples 
Slivkins and Upfal (2008) and Kocsis and Szepesvari (2006). In particular, Kocsis and Szepesvari (2006) 
have proposed an adaptation of the UCB policies that relies on a discount factor 7 G (0, 1). This policy 
constructs an UCB Xt{l, i) + ct{'J, i) for the instantaneous expected reward, where the discounted empirical 
average is given by 



for an appropriate parameter ^. Using these notations, discounted-UCB (D-UCB) is defined in Algorithm 2. 
Remark that for 7 = 1, D-UCB boils down to the standai^d UCB-1 algorithm. 

In order to estimate the instantaneous expected reward, the D-UCB policy averages past rewards with a 
discount factor giving more weight to recent observations. We propose in this paper a more abrupt variant 
of UCB where averages are computed on a fixed-size horizon. At time t, instead of averaging the rewards 
over all past with a discount factor, sliding-window UCB relies on a local empirical average of the observed 



[Rt] < 2V^y^KT{TT log(A'T) + e) . 




and the discounted padding function is defined as 
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Algorithm 2 Discounted UCB 



for t from 1 to K, play arm It = t; 
for t from K + ItoT, play aim 



It = argmaxXi(7,z) +Ci(7,i). 



l<i<X 



rewards, using only the r last plays. Specifically, this algorithm constructs an UCB Xtir, i) + q(t, i) for 
the instantaneous expected reward; the local empirical average is given by 



where t A r denotes the minimum of t and r, and ^ is a some appropriate constant. The policy defined in 
Algorithm 3 will be called in the sequel Sliding-Window UCB (SW-UCB). 

Algorithm 3 Sliding-Window UCB 
for t from 1 to A', play arm It = t; 
for t from K + ItoT, play arm 



In this paper, we investigate the behaviors of the discounted-UCB and of the sliding-window-UCB in an 
abruptly changing environment, and prove that they are almost rate-optimal in a minimax sense. In Section 
2, we derive a finite-time upper-bound on the regret of D-UCB. In Section 3, we propose a similar analysis 
for the SW-UCB policy. We establish that it achieves the slightly better regret. In Section 4, we establish 
a lower-bound on the regret of any policy in an abruptly changing environment. As a by-product, we show 
that any policy (like UCB-1) that achieves a logarithmic regret in the stationary case cannot reach a regret 
of order smaller than T/ log T in presence of breakpoints. The upper-bounds obtained in Sections 2 and 3 
are based on a novel deviation inequality for self-normalized averages with random number of summands 
which is stated and proved in Section A. A maximal inequality, of independent interest, is also derived in 
Section B. Two simple Monte-Carlo experiments aie presented to support our findings in Section 5. 

2. Analysis of Discounted UCB 

In this section, we analyze the behavior of D-UCB in an abruptly changing environment. Let denote the 
number of breakpoints before time T, and let Nrii) denote the number of times arm i was played when it 
was not the best arm during the T first rounds: 




and the padding function is defined as 




It = argmaxXt(r, i) + Q(r, i 

l<i<K 



T 



t=i 
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Denote by A^ixii) the minimum of the difference of expected reward of the best arm Ht{*) and the expected 
reward ^t(i) of the i-th arm for all times t G {1, . . . , T} such that arm i is not the leading arm (i ^ i^), 

AfiT{i) =nim{t£ {l,...,T},i^i*,fit{*) - fit{i)} . (1) 

We denote by and the probability distribution and expectation under policy D-UCB with discount 
factor 7. The next theorem computes a bound for the expected number of times in T rounds that the arm i 
is played, when this arm is suboptimal. 



Theorem 1 Let ^ > 1/2 and 7 G (0, 1). For any arm i G {1, . . . , K}, 

E, 



Nrit)] < B(7)r(l - 7) log + C(7)t^ log , (2) 

J 1~7 1 — 7I — 7 



where 



i6B^^ rni-7)i ^ -iog(i-7)/iog(i+ vi-1/20 



71/(1-7) (A/XT(i))2 T(l-7) -log(l- 7) (1-71/(1-7)) 

and 

log(l - 7) log 7 
Remark 2 W/ie?i 7 goes to 1 we /jave 0(7) —> 1 and 

166^2^ 

B(7) - ... + 



(A^T(i))2 (l-e-i)log(l + 4Vl- l/2e 



Proof The proof is adapted from the finite-type analysis of Auer et al. (2002). There are however two 
main differences. First, because the expected reward changes, the discounted empirical mean Xt{'^,i) is 
now a biased estimator of the expected reward /^ff(i). The second difference stems from the deviation 
inequality itself: instead of using a Chernoff-Hoeffding bound, we use a novel tailored-made control on a 
self-normalized mean of the rewards with a random number of summands. The proof is in 5 steps: 

Step 1 We upper-bound the number of times the suboptimal arm i is played as follows: 

T 

t=K+l 

T T 

t=K+l t=K+l 



where 

16^2^ log nT(7) 

Using Corollary 26 (stated and proved in the Appendix), we may upper-bound the first sum in the RHS as: 



^(7) = ■ (4) 



T 

Y Mh=^^^tMh,^)<Aiy)} < [Til - 7)] ^(7)7" V(l-7) . 
t=K+l 
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In the sequel, for any positive m, we denote by T{^) the set of all indices t G {K + 1, . . . , T} such that 

^s(j) = /^t (i) for ^11 J £ {1; • • • ) -f^} all t — 1^(7) < s <t, where 

^ ^ log((l -7)glog^/<(7)) 
log 7 

During a number of rounds (that depends on 7) following a breakpoint, the estimates of the expected rewards 
can be poor. Because of this, the D-UCB policy may play constantly the suboptimal arm i, which leads to 
the following bound: 

T 

t=K+l t£T{"/) 

Putting everything together, we obtain: 

NHi) < 1 + mi - 7)1^(7)7-'/^'-^^ + TtI?(7) + E MI^=^^^hNth,^)>A(■y)} ■ (5) 

Step 2 Now, for t G T{j) the event {It = i ^ it,Nt{'y, i) > ^(7)} rnay be decomposed as follows: 

{It = i^ ilNti-f,i) > A(7)} C {Xt{l,i) > fit{i) + ct{j,i)} U {Ml,*) < l^t{*) - q(7,*)} 

U{^t(*)-^t(i) <2Q(7,i),iVt(7,^) >^(7)}- (6) 

In words, playing the suboptimal arm i at time t may occur in three cases: if fit{i) is substantially over- 
estimated, if is substantially under-estimated, or if f-it{i) and are close from each other. But for 
the choice of ^(7) given in Equation (4), we have 



so that the event {/it(*) — < 204(7, ^t{l, i) > ^(7)} never occurs. 

In Steps 3 and 4 we upper-bound the probability of the two first events of the RHS of (6). We show 
that for t E T{'^), that is at least -D(7) rounds after a breakpoint, the expected rewai^ds of all arms ai^e well 
estimated with high probabiUty. For all j G {1, . . . , K}, consider the following events 

£t{i,j) = {Xti^.i) > + ct(7,i)} 

The idea is the following: we upper-bound the probability of £t{'y,j) by separately considering the fluctua- 
tions of XtilJ) around Mt{-y,j)/Nt{j,j), and the 'bias' Mt{j,j)/Nt{j,j) - fit{j), where 

t 

s=l 
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Step 3 Let us first consider the bias. First note that Mt{'~f,j)/Nt{'y,j), as a convex combination of ele- 
ments /Us(j) G [0,-B], belongs to interval [0,5]. Hence, \Mt{^,j)/Nt{'y,j) — fJ.t{j)\ < B. Second, for 
t G T(7), 



\Mti^,j)-fitij)Nti7)\ 



t-D{-i) 
s=l 



t-D(-i) t-D{j) 
s=l s=l 

As iVt_z5(^)(7,j) < (1 - 7)"', we get \Mt{j,j)/Nti-f,j) - fitij)\ < B-f^(^\l - Altogether, 



Mt(7,j) 



< B 



(lA7^(^)(l-7)-^) 



Hence, using the elementary inequality 1 A x < y/x and the definition of D{'y), we obtain for t G T{'y): 



(l-7)iVj(7,i) 



In words: -D(7) rounds after a breakpoint, the 'bias' is smaller than the half of the padding function. The 
other half of the padding function is used to control the fluctuations. In fact, for t G '^"(7): 



IP7 i^thJ)) < F7 ^t(7, j) > A^t(j) + B. 



I Clog nt{-/) 



+ 



/^t(j) 



Step 4 Denote the discounted total reward obtained with arm j by 



StilJ) = 5;7*-'l{7.=,}^.(j) = Nti-f,j)Xt{j,j) . 



s=l 



Using Theorem 18 and the fact that Nt{'y,j) > Nt{'y'^,j), the previous inequality rewrites: 

StilJ) - Mt{j,j) kNt{-/,j) log nt{j)\ 



< 



< 



( St{j,j)-Mt{j,j) 
log ntij) 



> B^ 



> By/C log nt{-f) 



log(l + 77) 

lognt(7) 
log(l + r]) 



exp -2^1ognt(7) 1 



16 



2€ 1-?. 
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Step 5 Hence, we finally obtain from Equation (5) 



NTii)] < 1 + \T{1 - 7)1^(7)7-^/(1-^) + D(7)Tt + 2 



log nt (7) 
log(l + r/) 



Ml) 



When Tt 7^ 0, 7 is taken strictly smaller than 1 (see Remark 3). As ^ > 2. we take 7] = 4^1 - 1/26 so 
that 2^ (1 - r?2/16) = 1. For that choice, with r = (1 - 7)"^ 



E 

tGr(7) 



log ntjj) 
log(l + r?) 



log n^(7) 
log(l + r?) 



< T- A' + 
<T-K + 



t=T 

logn^(7) ' 
log(l + T/) 

log(l + 77) 



^r(7)-' 



n 



'^r(7) 

Tjl-l) 

1 _ ^1/(1-7) ' 



we obtain the statement of the Theorem. 



Remark 3 If horizon T and the growth rate of the number of breakpoints are known in advance, the 
discount factor 7 can be chosen so as to minimize the RHS in Equation 2. Taking 7 = 1 — (4i?)"-'--^Tr/T 
yields: 

NT{i) = O [^/TTt logr) . 

Assuming that Tt = 0{T^) for some j3 € [0, 1), the regret is upper-bounded as O {T^'^^l^y^ logT). In 
particular, if (3 = 0, the number of breakpoints Ty is upper-bounded by T independently of T, taking 

7 = 1 — {AB)~^ a/T/T, the regret is bounded by O ^\/TTlog T^. Thus, D-UCB matches the lower-bound 
of Theorem 13 up to a factor log T. 

Remark 4 On the other hand, if the breakpoints have a positive density over time ( say, if Tt < rT for 
a small positive constant r ), then 7 has to remain lower-bounded independently of T; Theorem 1 gives 
a linear, non-trivial bound on the regret and permits to calibrate the discount factor j as a function of 
the density of the breakpoint: taking 7=1 — ^/r/{AB) we get an upper-bound with a dominant term in 
O (— Ty^log r). 

Remark 5 Theorem 22 shows that for ^ > 1/2 and t G '^'(7). with high probability Xt{'y, i) is actually 
never larger than Utii) + £4(7, i)- 

Remark 6 If the growth rate of Tt is known in advance, but not the horizon T, then we can use the 
"doubling trick" to set the value of Namely, for t and k such that 2^ < t < 2^^^, take 7=1 — 
{ABy\2''Y^-^y^. 
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3. Sliding window UCB 

In this section, we analyze the performance of SW-UCB in an abruptly changing environment. We denote 
by Ft and Ej- the probability distribution and expectation under policy SW-UCB with window size r. 



Theorem 7 Let > 1/2. For any integer r and any arm i E {1, . . . , K}, 



Nrii)] < C{t)^^ + tTt + log'(r) , 



(V) 



where 



C(t) 



4^2^ \T/t] 2 



+ 



(A//T(i))2 T/t logr 
Remark 8 As t goes to infinity 



log(r) 



log(l + 4^1 -(20-^) 



C(r) 



+ 



{AfiriiW log(l + 4Vl-(2e)"i)' 



Proof We follow the Unes of the proof of Theorem 1. The main difference is that for t G T{t) defined 
here as the set of all indices t S {K + 1, . . . , T} such that fis{j) = IJ-tij) for all j G {1, . . . , K] and all 
t — T < s < t, the bias exactly vanishes; consequently, Step 3 can be bypassed. 

Step 1 Let A{t) = 4i?^^logr(A/i7^(z))~^; using Lemma 25, we have: 



iVT(i) = i+ 



it) 



t=K+l 
T 



< 1 + X] '^{It=i,Nt{T,i)<A{T)} + X] ^{ 



t=l 



t=K+l 



< 1 + \T/t]A{t) + l{I,=i^q,Nt{r,i)>A{T 



)} 



t=K+l 



< 1 + \T/t]A{t) + TtT + Y '^{It=iy^it,Nt{r,i)>Air)} 

teT{r) 



(8) 



Step 2 For t G T{t) we have 



{It = i,Nt{T,i) > A{t)} C {Xt{T,i) > fit{i)+ct{T,i)} U {Xt{T,*) < - q(t, *)} 

U{fit{*)-fxtii)<2ct{T,i),Nt{T,i)>A{T)}. (9) 

On the event {Nt{T, i) > A{t)}, we have 



A{t) 



/ glog(r)(A^T(i))^ ^ A^T(i) 
4^2^ logr - 2 ^ 



so that the event — < 2ct(T, i),Nt{T, i) > A{t)} has P,- -probability 0. 
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Steps 3-4 Now, for t G T{t) and for all j £ {1, . . . , K}, Corollary 21 applies and yields: 



/ glog(tAT) 



< 



and similarly 



(Xt(T,j) <fit{j)-Ct{T,j)) 



logjt A r) 
log(l + r]) 
log(^ A r) 
log(l + rj) 

logjt A r) 
log(l + r/) 



< 



exp (^-2^1og(tAT) (^1- 

(tAT)-2C(lV/16) ^ 



(tAT)-25(lV/16)_ 



Steps 5 In the following we take i] = 4^1 — so that we have 2^ (l — r/^/16) = 1. 
Equations (9),(10) and (11), Inequahty (8) yields 



(10) 

(11) 
Thus, using 



Nrii)] < 1 + \T/t]A{t) + tTt + 2 ^ 



r iog(* 

I log(l 



(tAr) 



\ (tAr) 



The results follows, noting that 

T 



t=A'+l 



log(t A 



logt 

i=2 " t=l 



<y^ + yi^<liog2(r) + ^^. 

tAr t ^ T -2 ^^^^ r 



Remark 9 If the horizon T and the growth rate of the number of breakpoints Tt are known in advance, the 
window size t can be chosen so as to minimize the RHS in Equation (7). Taking r = 2B\jT log(r)/Tr 
yields 



Nrii) 

0{T^) for some j3 G [0, 1), the average regret is upper-bounded as O (T^^+'^^/^-y/IogT) 
0, the number of breakpoints Tt is upper-bounded by T independently ofT, then with 
T = IB^T log(r)/T the upper-bound is O (-^TT log T). Thus, SW-UCB matches the lower-bound of 
Theorem 13 up to a factor \^log T, slightly better than the D-UCB. 



Assuming that Tt 
In particular, if (3 



Remark 10 On the other hand, if the breakpoints have a positive density over time, then r has to remain 
lower-bounded independently ofT. For instance, ifTT < rT for some (small) positive rate r, and for the 
choice T = — log r/r, Theorem 7 gives 



E. 



NT{i) = O [Ty/-r log (r) 
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Remark 11 If there is no breakpoint (Tt = 0), the best choice is obviously to take the window as a large as 
possible, that is t = T. Then the procedure is exactly standard UCB. A slight modification of the preceeding 
proof for ^ = ^ + e with arbitrary small e yields 



E 



UCB 



< 



2B^ 



(A^(i))- 



■ iogr(i + o(i)). 



We recover the same kind of bounds that are usually obtained in the analysis of UCB, see for instance Auer 
et al. (2002), with a better constant. 



Remark 12 The computational complexity ofSW-UCB is, asforD-UCB, linear in time and does not involve 
T. However, SW-UCB requires to store the last r actions and rewards at each time t in order to efficiently 
update Nt{T, i) and Xt{T, i). 



4. A lower-bound on the regret in abruptly changing environment 

In this section, we consider a particular non-stationary bandit problem where the distributions of rewards on 
each arm are piecewise constant and have two breakpoints. Given any policy vr, we derive a lower-bound on 
the number of times a sub-optimal arm is played (and thus, on the regret) in at least one such game. Quite 
intuitively, the less explorative a policy is, the longer it may keep a suboptimal policy after a breakpoint. 
Theorem 13 gives a precise content to this statement. 

As in the previous section, K denotes the number of aims, and the rewards are assumed to be bounded in 
[0, i?]. Consider any deterministic policy vr of choosing the arms Ii, . . . ,1^ played at each time depending 
to the past rewards 

Gt ^ Xtilt), 

and recall that It is measurable with respect to the sigma-field a{Gi , ... ,Gt) of the past observed rewards. 
Denote by Ns:t{i) the number of times ai"m i is played between times s and t 

t 

N,.,t(i) = Y,^Iu=^}^ 

u=s 

and A^r(i) = iVi:T(i). For 1 < i < K, let Pi be the probability distribution of the outcomes of arm i, and 
let fi{i) denote its expectation. Assume that > for a\\2 < i < K. Denote by the distribution 
of rewards under policy vr, that is: 

T 

d¥^{gi.,T\Ii:T) =lldPM. 
t=l 

For any random variable W measurable with respect to (t{Gi, . . . , Gt), denote by E^rfW] its expectation 
under distribution P^r- 

In the sequel, we divide the period {1, . . . , T} into epochs of size r G {1, . . . , T}, and we modify the 
distribution of the rewards so that on one of those periods, arm K becomes the one with highest expected 
reward. Specifically: let Q be a distribution of rewards with expectation u > let 6 = u — //(I) and 

let a = D{Pk; Q) be the KuUback-Leibler divergence between Pk and Q. For all 1 < j < Af = [^J , we 
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consider the modification P:^ of Ptt such that on the j-th period of size r, the distribution of rewards of the 
K-th arm is changed to v. That is, for every sequence of rewards gi.r, 

t=l + {j-l)T,It=K 

Besides, let 

be the number of times arm i is played in the j-th period. For any random variable W in a{Gi, . . . , Gt), 
denote by Ei[VF] its expectation under distribution Fi-. Now, denote by P* the distribution of rewards when 
j is chosen uniformly at random in the set {1, ... , M} - in other words, P* is the (uniform) mixture of the 
{Fi)i<j<M, and denote by E* [•] the expectation under P* : 

1 *^ 

Km =—Y.^i[w]. 

In the following, we lower-bound the expect regret of any policy vr under P* in terms of its regret under P^^. 
Theorem 13 For any policy vr and any horizon T such that 64/(9a) < E^[A'7'(i^)] < T / (4a), 

where 

27a 

Proof The main ingredients of this reasoning are inspired by the proof of Theorem 5. 1 in Auer et al. 
(2002/03), see also Kulkarni and Lugosi (2000). First, note that the Kiillback-Leibler divergence D(P^, P^) 
is: 

T 

D(P,,Pi) = ^D(P^(Gt|Gi;t_i);Pi(Gt|Gi;t_i)) 
t=i 

jr 

K{It = K)I){PK;Q) 

i=l+(j-l)r 



Hence, by Lemma A.l in Auer et al. (2002/03), 



KW'iK)] < E^[N^{K)] + ^^D(P^,Pi) = E^[N\K)] + ^^J aE^[N^ (K)] 
Consequently, since J2jLi N^{K) < Nt{K), 

M Al 



3=1 i=i 
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Thus, there exists 1 < j < M such that 

1 



Now, the expectation under P* of the regret Rt is lower-bounded by: 



E:[RT]>5{T-K[NTiK)]) >5(^T-,^E^[NTm-yc^j^^ANT{K)]j ■ 

Maximizing the right hand side of the previous inequality by choosing r = 16T/ {9aE.,^[N (K)]) or equiva- 
lendy M = 9a / {16E^[N {K)]) leads to the lower-bound: 

325/ aE^[NT{K)]y / 16 \ T 



K[Rt]>^ i- - 1 



27a V T J \ 9aE^[NT{K)]J E^[Nt{K)] 

To conclude, simply note that Nt{K) < E^[iiT]/(^(l) - At(i^))- We obtain: 

^ 32<5(/i(l) -^(i^)) / aE^[NT{K)\ y f 16 \ 

27a V T J y 9aE^[NT{K)]J E^[Rt] 

which directly leads to the statement of the Theorem. 



The following corollary states that no policy can have a non-stationary regret of order smaller than y T. 
It appears here as a consequence of Theorem 13, although it can also be proved directly. 

Corollary 14 For any policy vr and any positive horizon T, 



max{E,(i?r),E;(i?T)} > VCifj^ . 

Proof If E^[iVr(ii')] < 16/(9a), or if E^[7Vt(-?^)] > T/a, the resuh is obvious. Otherwise, Theorem 13 
implies that: 

max{E,(i?T),E;(i?T)} > inax{E^{RT),C{fi)-^—} > yJC{ii)T . 



Remark 15 To keep simple notations, Theorem 13 is stated and proved here for deterministic policy. It is 
easily verified that the same results also holds for randomized strategies ( such as EXP3-P, see Auer et al. 
(2002/03)). 

Remark 16 In words, Theorem 13 states that for any policy not playing each arm often enough, there is 
necessarily a time where a breakpoint is not seen after a long period. For instance, as standard UCB satisfies 
E^[N{K)] = e(logr), then 

K[Rt] > 



logT 

for some positive c depending on the reward distribution. 
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Remark 17 This result is to be compared with standard minimax lower-bounds on the regret. On one hand, 
a fixed-game lower-bound in 0(log T) was proved in Lai and Robbins ( 1985) for the stationary case, when 
the distributions of rewards are fixed and T is allowed to go to infinity. On the other hand, a finite-time 
minimax lower-bound /or individual sequences in 0{VT) is proved in Auer et al. (2002/03). In this bound, 
for each horizon T the worst case among all possible reward distributions is considered, which explains 
the discrepancy. This result is obtained by letting the distance between distributions of rewards tend to 
(typically, as 1/\/T). In Theorem 13, no assumption is made on the distributions of rewards Pi and Q, their 
distance actually remains lower-bounded independently ofT. In fact, in the case considered here minimax 
regret and fixed-game minimal regret appear to have the same order of magnitude. 

5. Simulations 

We consider here two settings. In the first example, there are K = 3 arms and the time horizon is set to 
T = 10^. The agent goal is to minimize the expected regret. The rewards of arm i G {1, . . . , K} at time t 
are independent Bernoulli random variables with success probability pt{i), with = 0.5, pt(2) = 0.3 
and for f G {1, . . . ,T}: 

_ f 0.4 for t < 3000 or t > 5000, 
^ ~ \ 0.9 for 3000 < t < 5000. 

As one may notice, the optimal policy for this bandit task is to select arm 1 before the first breakpoint 
(t = 3000) and after the second breakpoint (t = 5000). In the left panel of Figure 1, we represent the 
evolution of two criteria in function of t: the number of times policy 1 has been played, and the cumulated 
regret (bottom plot). These two measures are obviously related, but they are not completely equivalent as 
sub-optimal arms can yield relatively high rewards. We compare the UCB-1 algorithm with = the 
EXP3.S algorithm described in Auer et al. (2002/03) with the tuned parameters given in Corollary 8.3 (with 
the notations of this paper a = T^^ and 7 = y^K(Tt log(Kr) + e)/[(e -1)T] with Tt = 2), the D-UCB 
algorithm with ^ = 1/2 and 7 = 1 - I/AVt and the SW-UCB with ^ = 1/2 and r = A^/nlogT. The 
parameters are tuned to obtain roughly optimal performance for the chosen horizon T and the number of 
breakpoints. 

As can be seen in Figure 1 (and as consistently observed over the simulations), D-UCB performs almost 
as well as SW-UCB. Both of them waste significantly less time than EXP3.S and UCB-1 to detect the 
breakpoints, and quickly concentrate their pulls on the optimal arm. Observe that policy UCB-1, initially 
the best, reacts very fast to the first breakpoint (t = 3000), as the confidence interval for arm 3 at this step is 
very loose. On the contrary, it takes a very long time after the second breakpoint (t = 5000) for UCB-1 to 
play arm 1 again. 

In the second example, there are if = 2 arms, the rewards are still Bernoulli random variables with 
parameters pt{i) but are in persistent, continuous evolution. Arm 2 is taken as a reference (pt(2) = 1/2 
for all t), and the parameter of arm 1 evolves periodically as: pt(l) = 0.5 + 0.4 cos (GirRt/T). Hence, the 
best arm to pull evolves cyclically and the transitions are smooth (regularly, the two arms are statistically 
indistinguishable). The middle plot in the right panel of Figure 1 represents the cumulative frequency of arm 
1 pulls: D-UCB, SW-UCB and, to a lesser extent, EXP3.S track the cycles, while UCB-1 fails to identify the 
best current arm. Below, the evolutions of the cumulative regrets under the four policies are shown: in this 
continuously evolving environment, the performance of D-UCB and SW-UCB are almost equivalent while 
UCB-1 and the Exp3.S algorithms accumulate larger regrets. 
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6. Conclusion and perspectives 

This paper theoretically establishes that the UCB policies can also be successfully adapted to cope with 
non-stationary environments. The upper bound of the SW-UCB in abruptly changing environment matches 
the upper bounds of the Exp3.S algorithm (i.e. 0{y/T log(T))), showing that UCB policies can be at least 
as good as the softmax methods. In practice, numerical experiments also support this finding. For the 
two examples considered in this paper, the D-UCB and SW-UCB policies outperform the optimality tuned 
version of the Exp3.S algorithm. 

The focus of this paper is on abruptly changing environment, but it is believed that the theoretical tools 
developed to handle the non-stationarity can be applied in different contexts. In particular, using a similar 
bias-variance decomposition of the discounted or windowed-rewards, the analysis of continuously evolving 
reward distributions can be done (and will be reported in a forthcoming paper). Furthermore, Theorems 18 
and 22, dealing with concentration inequality for discounted martingale transforms, are powerful tools of 
independent interest. 

As the previously reported Exp3.S algorithm, the performance of the proposed policy depends on tuning 
parameters, the discount factor for D-UCB and the window size for SW-UCB. These tuning parameters may 
be adaptively set, using data-driven approaches, such as the one proposed in Hartland et al. (2006). This is 
the subject of on-going research. 

Appendix A. A Hoeffding-type inequality for self-normalized means with a random number 
of summands 

Let {Xt)t>i be a sequence of non-negative independent bounded random variables defined on a probability 
space (0, A, P). We denote by B the upper bound, Xt G [0, B], P-a.s. and by fit its expectation fit = K[Xt]. 
Let J^t be an increasing sequence of fj-fields of A such that for each t, a{Xi . . . ,Xt) C Tt and for s > t, 
Xs is independent from JFj. Consider aprevisible sequence {et)t>i of Bernoulli variables (for all t > 0, et 
is jFj_i-measurable). Denote by (pt the Cramer transform of Xt: for A G M, 



(/.t(A) =logE[exp(AXt)] . 



For 7 G [0, 1), consider the following random variables 



Sth) = Y.^'~'X,e. 



■S 1 



Mt{-f) 



■S ) 



(12) 



s=l 



s=l 



s=l 



Let also 




Theorem 18 For all integers t and all 



5>0, 




for all r] > 0. 
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Remark 19 Actually, we prove the slightly stronger inequality: 



St{^)-Mt{^) 



>5 \ < 



lognf(7) 
log(l + T?) 



X exp 



8^2 



52((l+r/)i/4 + (i+^)-i/4)' 



(13) 



Proof First observe that we can assume et = 1> since otherwise {St{'^) — Mt{'^)) / y^Ntiy^ = {St^i{'y) — 
Mt-i{'^))/ y Nt^i{^'^) and the result follows from a simple induction. Second, note that for every positive 
A and for every u < t, since e^+i is predictible, and since Xu+i is independent from 

E [exp (AX„+ie„+i)| Tu] = exp (0„+i (Ae„+i)) = exp (A) e„+i) . 

Hence, as 5„+i(7) = 75'^ (7) + Xu+i 

/ u+1 ^ 



E 



exp AS„+i(7) (A7"+i-^) e. 



E 



exp A75„(7)-^0s((A7)7""')e. 



As 0(0) = 0, this proves by induction that 



E 



s=l 



1 . 



It is easily verified (see e.g. (Devroye et al., 1996, Lemma 8.1)) that under the stated assumptions, for all 
positive A, 

(14) 



2\2 



showing that 



E 



exp ( \{Sti-/) - Mj(7)} - ^X^Nti-f^) 



< 1. 



Hence, for any x > 0, the Markov inequality yields 



> 



+ 



B 



Now, take rj > 0, let D 



log rat (7) 
log(l+7?) 



exp A (Stij) - Mtil)) - —X'NtiY) > < exp(-x) . 



and, for every integer k {1, . . . , D}, define 



Xk 



8x 



52(1 + r?)* 



Elementary algebra shows that for all z such that (1 + 77) <z<(l + ?7), we have 



(l + r?)'=-^ 



+ 



{1 + 7])"^ 2 



_<(l + ^)l/4 + (l+^)-l 



/4 



(15) 
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Thus, if (1 + < Ntii^) < (1 + 7])", then 



AfcVW) 8 V8\^V Nti^') V(l + ^) 

<By|((l + ^)V4 + (i + ^)-i/4 

Therefore, as et = 1 we have 1 < A^f (7^) < (1 + ??)^ and 
fSt{j)-Mt{j) 



>i?y|((l + ,)V4 + (l+,)-l/4)| 



fc=i 

The union bound thus implies that: 



5,(7) - Mt(7) 



< 



fc = l 



For 5 = ((1 + r/)i/^ + (1 + r/)"^/^), this yields 

V v'iW) J \ BM(1+^)V« + (1+i,)-i/4)2^ 

The conclusion follows, as it is easy to see that, for all rj > 0, 



4 r/ 



2 



((l + r?)i/4 + (i+,^)-i/4) 



> 1 
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Remark 20 For example, taking rj = 0.3 in (13) yields 

/S,(7)-M,(7) A ^ -1^ 

Classical Hoeffding bounds for deterministic j/eW an upper-bound in 

18 



for all positive t. The factor behind the exponential and the very slightly larger exponent are the price to 
pay for the presence of random eg. Theorem 18 is maybe sub-optimal, but it is possible to show that for all 
(5 > and for an appropriate choice of the previsible sequence (es)s>i 



> 5 



1 



as t goes to infinity. 



If all variables Xt have the same expectation taking 7 = 1 in Theorem 18 immediately leads to the 
following corollary: 



Corollary 21 For all integers t and t, 



E 



s=(t-r+l)A 



DAli^s - l^)^s 



>6 \ < 



:(t-T + l)Al 



log(^ A t) 

log(l + 7]) 



exp 



252 r^_rf 



52 
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Appendix B. A maximal inequality for self-normalized means with a random number of 
summands 

In this section, we prove a stronger version of Theorem 18: we upper-bound the probability that, at some 
time t, the average reward deviates from its expectation. We keep the same notations as in Section A. 



Theorem 22 For all positive integer T and all 5 > 0, 



P sup ^' - >S \ < 

\l<t<T v^W) J ~ 

for all T] > 0. 

Remark 23 Note that if < 1, then 



log (7 '^'^riTi'y'^)) 
log(l + 1]) 



^"Pl-^l' 16 



log (7 ^'^nT(7^)) 



< 



2T(1 - 7) 
7 



+ log 



1-7^ 



while for 7 = 1 we have: 



log (7-'^nT(7')) =logr. 
Remark 24 Classical Hoeffding bounds for deterministic eg yield an upper-bound in 



>5] < exp(-2r) 



for all positive t. The factor behind the exponential (depending on T and e) and the very slightly larger 
exponant are the price to pay for uniformity in t. For example, taking r] = 0.3 yields 



sup 



St(7)-Mi(7) 



>6\ < [41og(7-2^nT(7'))lexp(^- 



1.9952 
~B2~ 
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Proof For A > 0, define 

= exp [x^'Stil) -YJs (At"^) j • (17) 

Note that 

E [exp {\-i^'Xtet) \ Tt-i] = exp [etckt (A7~*)) • 
Since "f~*-St{'^) = St~i{-j) + ^~^Xtet, we may therefore write 

E [exp(A7-*St(7))| ^t-i] = exp [x^r^'-^^ St^i{^)) exp {etU^^~')) , 

showing that {Z^} is a martingale adapted to the filtration JT = {Tt,t > 0}. As already mentionned (see 
e.g. (Devroye et al., 1996, Lemma 8.1)), under the stated assumptions 

0i(A) < A/xt + 52_^V8, 

showing that for all A > 0, 

= exp (A7-*5t(7)- A7-*Aft(7) - {By8)X^j-^'Nt{-f^)) (18) 
is a super-martingale. Hence, for any x > we have 

P ( sup > exp(x) ) < exp(-x) . (19) 

\l<t<T / 



On the other hand, note that 



Now, let D 



log(7 ^^nT(7^)) 
log(l+r;) 



and for every integer k £ {1, . . . , D}, define 



8x 



Thus, if (1 + f])''-'^ < 7~2*Arj(^2^ < (1 + r?)'', then using Equation (15) yields: 



<Sy|((l + r?)V4 + (i + ^)-i/4 

which proves, using Equation (20), that 

/ gt(7)-M,(7) ^ ^ /^^ ^ ^^V4 + (1 + ^)-i/4 



2^ 



(20) 
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But as 

St{^) - Mt{j) St{j) - Mtij) 
sup ^ — = sup j^=^= — , 

i<t<r V^t(7^) i<t<T,et=i y'Ntij'^) 

we assume = 1 and thus 1 < Nt{'y'^) < (1 + v)^ ■ Hence, thanks to Equation (19) we obtain: 

< P I y U ^Wt^" > exp(x)} I = P I J (J > exp(x)| j < Z?exp(- 

\l<t<Tl<k<D J \l<k<Dl<t<T J 



-x) . 



For 

and using Equation (16), this yields 



> 5] <D exp 



<^exp(-^ 1-f^ 



Appendix C. Technical results 

Lemma 25 For any i £ {1, . . . , K} and for any positive integer r, let Nf^r-.ti^ji) = Yll=t-T+i ^{it=i}- 
Then for any positive m, 

T 

^{h=i,Nt-r:til,i)<m} < K\T/T'\m . 

t=K+l 

Proof 

T [T/rl jr 

'Yl{It=i,Nt-r:t{l,i)<ni} < '^{It=i,Nt-r:t{l,i)<m}- 

t = l J = l t=(j-l)T+l 

For any given j G {1, . . . , \T/t]}, either l{/t=i,7Vt-.:t(i,i)<m} = or there exists an index 

t G {{j — 1)t + 1, . . . ,jr} such that = i, Nt^r-.ti^^i) < m. In this case, we put tj = max{t G 
{{j — l)r + 1, . . . , jr} : It = i, Nt~r:t{^,i) < rn}, the last time this condition is met in the j-th block. 
Then, 

^ '^{It=i,Nt~r:ta,i)<m} = X] '^{It=i,Nt-T:til,i)<m} 
t=U-l)T+l t=(j-l)r+l 

tj % 

< Yl Mlt=i,Nt-r:t{l,i)<'m} < J2 Mlt=i} = Ntj-T:tj{l,i) <m. 



21 



Corollary 26 For any i G {1 



. . , K}, any integers r > 1 and A > 0, 



T 



E HI^=^M{l,^)<A}<K\T/T^Ar^ . 



t=K+l 



Proof Simply note that 



T T 




(21) 



and apply the preceeding lemma with m = 7 A. 
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Figure 1: Left panel: Bernoulli MAB problem with two swaps. Upper: evolution of the probability of 
having a reward 1 for each arm; Middle: cumulative frequency of arm 1 pulls for each policy. 
Below: cumulative regret of each policy. Right panel: Bernoulli MAB problem with periodic 
rewards: Upper: evolution of the probability of having a reward 1 for aim 1 (and time intervals 
when it should be played); Middle: cumulative frequency of arm 1 pulls for each policy. Below: 
cumulative regret of each policy. 
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